Abstract

NEON provides data varying over a broad selection of ecosystems and phylum that are collected. We are going to analyze the methods in which the data from the site, Great Basin, Onaqui, Utah, USA is collected as well as data regarding the phylum, Methylomirabilota is collected. With reference to this data, various questions are asked in order to better understand these findings. The results will contain graphs, tables and phylogeny trees in order to visually comprehend the data collected by NEON. Using these data and visualizations, we will be able to analyze and comprehend the types of ecosystems in the USA, as well as how the different phylum in these ecosystems affect the sites.

Motivating Reasons

NEON is an observational facility that’s purpose is to collect ecological data. With the data collected, they work with this information to appreciate and recognize the ecosystems in the United States. Additionally, they are able to look at this information over time to see how the ecosystems in America are changing, and perhaps what is changing them. NEON’s motivation is to eventually maintain the ecosystems in a sustainable environment using their partners and community. With the data NEON has collected, we have taken the information specifically regarding Great Basin, Onaqui, Utah, USA and Methylomirabilota. We are transforming their data into visualizations so that we can better understand their findings and display it in a more inclusive manner. We then can analyze the tables and graphs and make claims about the specific sites and phylum and how they relate to each other.

Introduction

NEON is a foundation also known as National Ecological Observatory Network. The purpose of NEON is to find ecological data based on different sites and taxonomic rankings to better understand the changes that are made over time. Specifically, we are looking at the terrestrial site of Great Basin, Onaqui, Utah, USA. At this specific site, we are going to take a look at which phylum, class and family of bacteria are located at this site, and how this affects the ecology of the location. Additionally, we are also going to look at the phylum, Methylomirabilota and analyze the presence of this phylum in different locations.

Onaqui is a terrestrial field site which is located about 50 miles southwest of Salt Lake City. The climate of this site is described to be warm with little precipitation, arid with hot summers and cold winters. There are a series of soils collected at this site which includes, taylorsflat, sterling, sevy, strevell and many more. The vegetation found at this site is located predominately on the eastern side of the site as well as the base of the mountains, and up the woodlands. The fauna found at this site includes coyotes, jackrabbits, rattlesnakes and other small mammals and birds. The current land management is under the control of the Bureau of Land Management. This allows for many different uses of the site including data collection, recreation as well as hunting.

Methylomirabilota is a bacteria which belongs to the phylum also known as NC10. This bacteria is known for its biogeochemical impact on different locations in which it is found. However, there is still much to be discovered about the methylomirabilota phylum. The main function of this bacteria is its ability to preform oxidation of methane as well as denitrification. This is done aerobically. The bacteria is found in a diverse selection of habitats, which is highlighted in the results section. The importance of methylomirabilota is that it contributes to the methane regulation and control in the ecosystem.

Methods

NEON used multiple data collecting methods to receive the samples for each site as well as phylum. For each site the data was collected to report the weather, climate land cover and species within the ecosystem. The 3 methods that were used for data collection was Airborne Remote Sensing, Automated Instruments and observational sampling. The Airborne Remote sensing used spectrometers, digital cameras, lidar, GPS and Inertial Measurement unit in order to observe data. The automated instruments were used to collect soil, surface water and ground water to examine patterns as well as the bacteria found in these locations. Finally, observations were split into aquatic observations and terrestrial observations in which species diversity and environmental or chemical properties could be examined. The data was presented on the NEON website in which we were able to retrieve the data for our corresponding site and phylum. This specific data was then translated into a csv file so that it could successfully be imported into Rmarkdown. This is when the data could then be configured to represent different graphs and tables to present the data. This was done by asking site or phylum specific questions with respect to the data. The questions asked during this process is mentioned below.

Questions

Site Specific:

  • Which MAG’s are found within the subplots of Onaqui?

  • What is the taxonomic breakdown at Onaqui?

  • Are there any novel bacteria found in Onaqui?

  • What is the correlation of site/ecosystem subtype to soil temperature and soil pH?

Phylum Specific:

  • Where in the US are the phylum Methylomirabilota found?

  • Is Methylomirabilota found in our selected site?

  • Where is each order found?

  • What are the individual and co assemblies for Methylomirabilota?

  • What is the soil temperature for this phylum?

  • What are the ecosystem sub types for Methylomirabilota?

Results

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly) 
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(knitr) 
library(DT)

Site Specific

Question: Which MAG’s are found within the subplots of Onaqui?

NEON_MAGs <- read_csv("data/NEON/GOLD_Study_ID_Gs0161344_NEON_edArchaea.csv") %>% 
  # remove columns that are not needed for data analysis
  select(-c(`GOLD Study ID`, `Bin Methods`, `Created By`, `Date Added`)) %>% 
  # create a new column with the Assembly Type
  mutate("Assembly Type" = case_when(`Genome Name` == "NEON combined assembly" ~ `Genome Name`,
                            TRUE ~ "Individual")) %>% 
  mutate_at("Assembly Type", str_replace, "NEON combined assembly", "Combined") %>% 
  separate(`GTDB-Tk Taxonomy Lineage`, c("Domain", "Phylum", "Class", "Order", "Family", "Genus"), "; ", remove = FALSE) %>% 
  # Get rid of the the common string "Soil microbial communities from "
  mutate_at("Genome Name", str_replace, "Terrestrial soil microbial communities from ", "") %>% 
  # Use the first `-` to split the column in two
  separate(`Genome Name`, c("Site","Sample Name"), " - ") %>% 
  # Get rid of the the common string "S-comp-1"
  mutate_at("Sample Name", str_replace, "-comp-1", "") %>%
  # separate the Sample Name into Site ID and plot info
  separate(`Sample Name`, c("Site ID","subplot.layer.date"), "_", remove = FALSE,) %>% 
  # separate the plot info into 3 columns
  separate(`subplot.layer.date`, c("Subplot", "Layer", "Date"), "-") 
## Rows: 1754 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (8): Bin ID, Genome Name, Bin Quality, Bin Lineage, GTDB-Tk Taxonomy L...
## dbl  (10): IMG Genome ID, Bin Completeness, Bin Contamination, Total Number ...
## date  (1): Date Added
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Warning: Expected 6 pieces. Additional pieces discarded in 46 rows [3, 4, 24, 25, 26,
## 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 54, 232, 267, ...].
## Warning: Expected 6 pieces. Missing pieces filled with `NA` in 446 rows [1, 2, 9, 10,
## 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 46, 50, 53, ...].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 624 rows [4, 7, 8, 236,
## 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252,
## ...].
read_tsv("data/NEON/exported_img_data.tsv")
## Rows: 176 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (8): Domain, Sequencing Status, Study Name, Genome Name / Sample Name, S...
## dbl (4): taxon_oid, IMG Genome ID, Genome Size  * assembled, Gene Count  * a...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 176 × 12
##     taxon_oid Domain     `Sequencing Status` `Study Name` Genome Name / Sample…¹
##         <dbl> <chr>      <chr>               <chr>        <chr>                 
##  1 3300069219 *Microbio… Permanent Draft     Terrestrial… Terrestrial soil micr…
##  2 3300069216 *Microbio… Permanent Draft     Terrestrial… Terrestrial soil micr…
##  3 3300062116 *Microbio… Permanent Draft     Terrestrial… Terrestrial soil micr…
##  4 3300060668 *Microbio… Permanent Draft     Terrestrial… Terrestrial soil micr…
##  5 3300060914 *Microbio… Permanent Draft     Terrestrial… Terrestrial soil micr…
##  6 3300069208 *Microbio… Permanent Draft     Terrestrial… Terrestrial soil micr…
##  7 3300067032 *Microbio… Permanent Draft     Terrestrial… NEON combined assembly
##  8 3300061641 *Microbio… Permanent Draft     Terrestrial… Terrestrial soil micr…
##  9 3300069224 *Microbio… Permanent Draft     Terrestrial… Terrestrial soil micr…
## 10 3300069268 *Microbio… Permanent Draft     Terrestrial… Terrestrial soil micr…
## # ℹ 166 more rows
## # ℹ abbreviated name: ¹​`Genome Name / Sample Name`
## # ℹ 7 more variables: `Sequencing Center` <chr>, `IMG Genome ID` <dbl>,
## #   `GOLD Study ID` <chr>, Latitude <chr>, Longitude <chr>,
## #   `Genome Size  * assembled` <dbl>, `Gene Count  * assembled` <dbl>
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
## Bioconductor version '3.18' is out-of-date; the current release version '3.19'
##   is available with R version '4.4'; see https://bioconductor.org/install
BiocManager::install("treeio")
## 'getOption("repos")' replaces Bioconductor standard repositories, see
## 'help("repositories", package = "BiocManager")' for details.
## Replacement repositories:
##     CRAN: http://rspm/default/__linux__/focal/latest
## Bioconductor version 3.18 (BiocManager 1.30.23), R 4.3.3 (2024-02-29)
## Warning: package(s) not installed when version(s) same as or greater than current; use
##   `force = TRUE` to re-install: 'treeio'
## Installation paths not writeable, unable to update packages
##   path: /opt/R/4.3.3/lib/R/library
##   packages:
##     boot, codetools, lattice, survival
BiocManager::install("ggtreeExtra")
## 'getOption("repos")' replaces Bioconductor standard repositories, see
## 'help("repositories", package = "BiocManager")' for details.
## Replacement repositories:
##     CRAN: http://rspm/default/__linux__/focal/latest
## Bioconductor version 3.18 (BiocManager 1.30.23), R 4.3.3 (2024-02-29)
## Warning: package(s) not installed when version(s) same as or greater than current; use
##   `force = TRUE` to re-install: 'ggtreeExtra'
## Installation paths not writeable, unable to update packages
##   path: /opt/R/4.3.3/lib/R/library
##   packages:
##     boot, codetools, lattice, survival
library(tidyverse)
library(knitr)
library(ggtree)
## ggtree v3.10.1 For help: https://yulab-smu.top/treedata-book/
## 
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
## 
## Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam.
## ggtree: an R package for visualization and annotation of phylogenetic
## trees with their covariates and other associated data. Methods in
## Ecology and Evolution. 2017, 8(1):28-36. doi:10.1111/2041-210X.12628
## 
## S Xu, Z Dai, P Guo, X Fu, S Liu, L Zhou, W Tang, T Feng, M Chen, L
## Zhan, T Wu, E Hu, Y Jiang, X Bo, G Yu. ggtreeExtra: Compact
## visualization of richly annotated phylogenetic data. Molecular Biology
## and Evolution. 2021, 38(9):4039-4042. doi: 10.1093/molbev/msab166
## 
## Guangchuang Yu, Tommy Tsan-Yuk Lam, Huachen Zhu, Yi Guan. Two methods
## for mapping and visualizing associated data on phylogeny using ggtree.
## Molecular Biology and Evolution. 2018, 35(12):3041-3043.
## doi:10.1093/molbev/msy194
## 
## Attaching package: 'ggtree'
## The following object is masked from 'package:tidyr':
## 
##     expand
library(TDbook) #A Companion Package for the Book "Data Integration, Manipulation and Visualization of Phylogenetic Trees" by Guangchuang Yu (2022, ISBN:9781032233574).
library(ggimage)
library(rphylopic)
## You are using rphylopic v.1.4.0. Please remember to credit PhyloPic contributors (hint: `get_attribution()`) and cite rphylopic in your work (hint: `citation("rphylopic")`).
## 
## Attaching package: 'rphylopic'
## The following object is masked from 'package:ggimage':
## 
##     geom_phylopic
library(treeio)
## treeio v1.26.0 For help: https://yulab-smu.top/treedata-book/
## 
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
## 
## LG Wang, TTY Lam, S Xu, Z Dai, L Zhou, T Feng, P Guo, CW Dunn, BR
## Jones, T Bradley, H Zhu, Y Guan, Y Jiang, G Yu. treeio: an R package
## for phylogenetic tree input and output with richly annotated and
## associated data. Molecular Biology and Evolution. 2020, 37(2):599-603.
## doi: 10.1093/molbev/msz240
## 
## Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam.
## ggtree: an R package for visualization and annotation of phylogenetic
## trees with their covariates and other associated data. Methods in
## Ecology and Evolution. 2017, 8(1):28-36. doi:10.1111/2041-210X.12628
## 
## Guangchuang Yu. Using ggtree to visualize data on tree-like structures.
## Current Protocols in Bioinformatics. 2020, 69:e96. doi:10.1002/cpbi.96
library(tidytree)
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
## 
## Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam.
## ggtree: an R package for visualization and annotation of phylogenetic
## trees with their covariates and other associated data. Methods in
## Ecology and Evolution. 2017, 8(1):28-36. doi:10.1111/2041-210X.12628
## 
## LG Wang, TTY Lam, S Xu, Z Dai, L Zhou, T Feng, P Guo, CW Dunn, BR
## Jones, T Bradley, H Zhu, Y Guan, Y Jiang, G Yu. treeio: an R package
## for phylogenetic tree input and output with richly annotated and
## associated data. Molecular Biology and Evolution. 2020, 37(2):599-603.
## doi: 10.1093/molbev/msz240
## 
## Attaching package: 'tidytree'
## The following object is masked from 'package:treeio':
## 
##     getNodeNum
## The following object is masked from 'package:stats':
## 
##     filter
library(ape)
## 
## Attaching package: 'ape'
## The following objects are masked from 'package:tidytree':
## 
##     drop.tip, keep.tip
## The following object is masked from 'package:treeio':
## 
##     drop.tip
## The following object is masked from 'package:ggtree':
## 
##     rotate
## The following object is masked from 'package:dplyr':
## 
##     where
library(TreeTools)
## 
## Attaching package: 'TreeTools'
## The following object is masked from 'package:tidytree':
## 
##     MRCA
## The following object is masked from 'package:treeio':
## 
##     MRCA
## The following object is masked from 'package:ggtree':
## 
##     MRCA
library(phytools)
## Loading required package: maps
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
## 
## Attaching package: 'phytools'
## The following object is masked from 'package:TreeTools':
## 
##     as.multiPhylo
## The following object is masked from 'package:treeio':
## 
##     read.newick
library(ggnewscale)
library(ggtreeExtra)
## ggtreeExtra v1.12.0 For help: https://yulab-smu.top/treedata-book/
## 
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
## 
## S Xu, Z Dai, P Guo, X Fu, S Liu, L Zhou, W Tang, T Feng, M Chen, L
## Zhan, T Wu, E Hu, Y Jiang, X Bo, G Yu. ggtreeExtra: Compact
## visualization of richly annotated phylogenetic data. Molecular Biology
## and Evolution. 2021, 38(9):4039-4042. doi: 10.1093/molbev/msab166
library(ggstar)
NEON_MAGs <- read_csv("data/NEON/GOLD_Study_ID_Gs0161344_NEON_2024_4_21.csv") %>% 
  # remove columns that are not needed for data analysis
  select(-c(`GOLD Study ID`, `Bin Methods`, `Created By`, `Date Added`, `Bin Lineage`)) %>% 
  # create a new column with the Assembly Type
  mutate("Assembly Type" = case_when(`Genome Name` == "NEON combined assembly" ~ `Genome Name`,
                            TRUE ~ "Individual")) %>% 
  mutate_at("Assembly Type", str_replace, "NEON combined assembly", "Combined") %>% 
  mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "d__", "") %>%  
  mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "p__", "") %>% 
  mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "c__", "") %>% 
  mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "o__", "") %>% 
  mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "f__", "") %>% 
  mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "g__", "") %>% 
  mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "s__", "") %>%
  separate(`GTDB-Tk Taxonomy Lineage`, c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species"), ";", remove = FALSE) %>% 
  mutate_at("Domain", na_if,"") %>% 
  mutate_at("Phylum", na_if,"") %>% 
  mutate_at("Class", na_if,"") %>% 
  mutate_at("Order", na_if,"") %>% 
  mutate_at("Family", na_if,"") %>% 
  mutate_at("Genus", na_if,"") %>% 
  mutate_at("Species", na_if,"") %>% 
  
  # Get rid of the the common string "Soil microbial communities from "
  mutate_at("Genome Name", str_replace, "Terrestrial soil microbial communities from ", "") %>% 
  # Use the first `-` to split the column in two
  separate(`Genome Name`, c("Site","Sample Name"), " - ") %>% 
  # Get rid of the the common string "S-comp-1"
  mutate_at("Sample Name", str_replace, "-comp-1", "") %>%
  # separate the Sample Name into Site ID and plot info
  separate(`Sample Name`, c("Site ID","subplot.layer.date"), "_", remove = FALSE,) %>% 
  # separate the plot info into 3 columns
  separate(`subplot.layer.date`, c("Subplot", "Layer", "Date"), "-")
## Rows: 1754 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (8): Bin ID, Genome Name, Bin Quality, Bin Lineage, GTDB-Tk Taxonomy L...
## dbl  (10): IMG Genome ID, Bin Completeness, Bin Contamination, Total Number ...
## date  (1): Date Added
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 624 rows [1131, 1132,
## 1133, 1134, 1135, 1136, 1137, 1138, 1139, 1140, 1141, 1142, 1143, 1144, 1145,
## 1146, 1147, 1148, 1149, 1150, ...].
NEON_metagenomes <- read_tsv("data/NEON/exported_img_data_Gs0161344_NEON.tsv") %>% 
  select(-c(`Domain`, `Sequencing Status`, `Sequencing Center`)) %>% 
  rename(`Genome Name` = `Genome Name / Sample Name`) %>% 
  filter(str_detect(`Genome Name`, 're-annotation', negate = T)) %>% 
  filter(str_detect(`Genome Name`, 'WREF plot', negate = T)) 
## Rows: 176 Columns: 46
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (18): Domain, Sequencing Status, Study Name, Genome Name / Sample Name, ...
## dbl (16): taxon_oid, IMG Genome ID, Depth In Meters, Elevation In Meters, Ge...
## lgl (12): Altitude In Meters, Chlorophyll Concentration, Longhurst Code, Lon...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
NEON_metagenomes <- NEON_metagenomes %>% 
  # Get rid of the the common string "Soil microbial communities from "
  mutate_at("Genome Name", str_replace, "Terrestrial soil microbial communities from ", "") %>% 
  # Use the first `-` to split the column in two
  separate(`Genome Name`, c("Site","Sample Name"), " - ") %>% 
  # Get rid of the the common string "-comp-1"
  mutate_at("Sample Name", str_replace, "-comp-1", "") %>%
  # separate the Sample Name into Site ID and plot info
  separate(`Sample Name`, c("Site ID","subplot.layer.date"), "_", remove = FALSE,) %>% 
  # separate the plot info into 3 columns
  separate(`subplot.layer.date`, c("Subplot", "Layer", "Date"), "-") 
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [53].
NEON_chemistry <- read_tsv("data/NEON/neon_plot_soilChem1_metadata.tsv") %>% 
  # remove -COMP from genomicsSampleID
  mutate_at("genomicsSampleID", str_replace, "-COMP", "")
## Rows: 87 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr   (5): genomicsSampleID, siteID, plotID, nlcdClass, horizon
## dbl  (11): decimalLatitude, decimalLongitude, elevation, soilTemp, d15N, org...
## date  (1): collectionDate
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
NEON_MAGs_metagenomes_chemistry <- NEON_MAGs %>% 
  left_join(NEON_metagenomes, by = "Sample Name") %>% 
  left_join(NEON_chemistry, by = c("Sample Name" = "genomicsSampleID")) %>% 
  rename("label" = "Bin ID")
tree_arc <- read.tree("data/NEON/gtdbtk.ar53.decorated.tree")
tree_bac <- read.tree("data/NEON/gtdbtk.bac120.decorated.tree")
# Make a vector with the internal node labels
node_vector_bac = c(tree_bac$tip.label,tree_bac$node.label)

# Search for your Phylum or Class to get the node
grep("Methylomirabilota", node_vector_bac, value = TRUE)
## [1] "'0.999:p__Methylomirabilota; c__Methylomirabilia'"
match(grep("Methylomirabilota", node_vector_bac, value = TRUE), node_vector_bac)
## [1] 2651
# Make a vector with the internal node labels
node_vector_arc = c(tree_arc$tip.label,tree_arc$node.label)

# Search for your Phylum or Class to get the node
grep("p__", node_vector_arc, value = TRUE)
## [1] "'1.0:p__Halobacteriota; c__Methanosarcinia; o__Methanosarcinales; f__Methanosarcinaceae; g__Methanosarcina'"                                            
## [2] "'1.0:p__Thermoplasmatota; c__E2; o__JACPAO01; f__JAHFTW01'"                                                                                             
## [3] "'1.0:p__Methanobacteriota; c__Methanobacteria; o__Methanobacteriales; f__Methanobacteriaceae; g__Methanobacterium_B; s__Methanobacterium_B sp003151535'"
## [4] "'1.0:p__Thermoproteota'"
match(grep("p__", node_vector_arc, value = TRUE), node_vector_arc)
## [1] 46 50 54 55
# First need to preorder tree before extracting. N
tree_bac_preorder <- Preorder(tree_bac)
tree_Methylomirabilota <- Subtree(tree_bac_preorder, 1712)
NEON_MAGs_Methylomirabilota <- NEON_MAGs_metagenomes_chemistry %>% 
  filter(Phylum == "Methylomirabilota")
NEON_MAGs_metagenomes_chemistry_ONAQ <- NEON_MAGs_metagenomes_chemistry %>% 
  filter(`Site ID.x` == "ONAQ")
NEON_MAGs_metagenomes_chemistry_ONAQ <- NEON_MAGs_metagenomes_chemistry %>% 
  filter(`Site ID.x` == "ONAQ") %>% 
  filter(Domain == "Bacteria")
ONAQ_MAGs_label <- NEON_MAGs_metagenomes_chemistry_ONAQ$label
tree_bac_ONAQ_MAGs <-drop.tip(tree_bac,tree_bac$tip.label[-match(ONAQ_MAGs_label, tree_bac$tip.label)])
ggtree(tree_bac_ONAQ_MAGs, layout="circular")  %<+%
  NEON_MAGs_metagenomes_chemistry +
  geom_point(mapping=aes(color=Phylum))

This is a phylogeny tree of the phylums present in the location Onaqui. These are the MAG’s specific to this site, however, just shown in a different visualization. In the image, it shows that the majority of the phylum are very close to each other on the tree, which means they are very closely related.

NEON_MAGs_bact_ind <- NEON_MAGs %>% 
  filter(Domain == "Bacteria") %>% 
  filter(`Assembly Type` == "Individual") 
NEON_utah <- NEON_MAGs_bact_ind %>%
  filter(Site== "Great Basin, Onaqui, Utah, USA")
NEON_utah %>% 
ggplot(aes(x = fct_rev(fct_infreq(Phylum)), fill = Subplot)) +
  geom_bar() +
  coord_flip()

labs(title = "MAG Count for each Subplot")
## $title
## [1] "MAG Count for each Subplot"
## 
## attr(,"class")
## [1] "labels"

This graph is showing the subplot for the site, Onaqui, Great Basin, Utah, USA. Additionally, it is showing the MAG count of each type of bacteria that is present in that subplot. Actinobacteriota is found in the most subplots, and found most frequently in subplot 004.

Question: What is the taxonomic breakdown at Onaqui?

NEON_MAGs_bact_ind %>% 
ggplot(aes(x = fct_rev(fct_infreq(Phylum)), fill = Site)) +
  geom_bar(position = "dodge") +
  coord_flip()

The graph above is showing the count of each site according to each phylum. The colors represent how much of the count is in each location. When looking at Methylomirabilota, there is a very small count of this phylum in the USA. The majority of this is in Texas, USA. The site Onaqui has the majority of it’s count in Actinobacteriota.

NEON_MAGs_bact_ind %>% 
ggplot(aes(x = Phylum)) +
  geom_bar(position = position_dodge2(width = 0.9, preserve = "single")) +
  coord_flip() +
  facet_wrap(vars(Site), scales = "free", ncol = 2)

This image gives a closer look at each site that was demonstrated in the graph before. This shows that specifically in our site, Onaqui, there is a drastic change in the count between Actinobacteriota and the rest of the phylum at that site.

Phylum Count
Phylum Count

This image is showing the phylum count that is found at this specific location. This shows that Actinobacteria has the highest count at this location and Desulfobacteriota_B has the lowest count. The phylum Methylomirabilota is not found.

Order in Phylum
Order in Phylum

This image shows the count of each order in each phylum found at Onaqui. Actinobacteriota has the largest variety of order within its count, with the largest amount being from the Entotheonellales order.

NEON_utah %>% 
  ggplot(aes(x = fct_rev(fct_infreq(Order)), fill = `Family`)) + geom_bar() + coord_flip() + 
  labs(title = "Family in each order", y = "Count", x = "Order")

This graph is showing the Family in each order as a break down of what is present at Onaqui. It is shown that the families WHSQ01, 70-9, soilrubrobacteraceae have the highest overall counts specifically in the orders of CADDZG01 and soilrubrobacterales.

Question: Are there any novel bacteria found in Onaqui?

NEON_MAGs_bact_ind %>% 
  filter(is.na(Class) | is.na(Order) | is.na(Genus) | is.na(Family) | is.na(Phylum) | is.na(Domain))%>%
ggplot(aes(x = fct_infreq(Site))) +
  geom_bar() +
  coord_flip()

This graph is showing the count of novel bacteria found at each site. The site we have selected for, Great, Basin, Onaqui, Utah, USA has a count of ~28 novel bacteria.

Question: What is the correlation of site/ecosystem subtype to soil temperature and soil pH?

Soil Temperature for each Site
Soil Temperature for each Site

This graph is showing the soil temperature for each site. The far left box plot is showing our site with has a range of soil temperature from 11-16°.

NEON_MAGs <- read_csv("data/NEON/GOLD_Study_ID_Gs0161344_NEON_edArchaea.csv") %>% 
  # remove columns that are not needed for data analysis
  select(-c(`GOLD Study ID`, `Bin Methods`, `Created By`, `Date Added`)) %>% 
  # create a new column with the Assembly Type
  mutate("Assembly Type" = case_when(`Genome Name` == "NEON combined assembly" ~ `Genome Name`,
                            TRUE ~ "Individual")) %>% 
  mutate_at("Assembly Type", str_replace, "NEON combined assembly", "Combined") %>% 
  separate(`GTDB-Tk Taxonomy Lineage`, c("Domain", "Phylum", "Class", "Order", "Family", "Genus"), "; ", remove = FALSE) %>% 
  # Get rid of the the common string "Soil microbial communities from "
  mutate_at("Genome Name", str_replace, "Terrestrial soil microbial communities from ", "") %>% 
  # Use the first `-` to split the column in two
  separate(`Genome Name`, c("Site","Sample Name"), " - ") %>% 
  # Get rid of the the common string "S-comp-1"
  mutate_at("Sample Name", str_replace, "-comp-1", "") %>%
  # separate the Sample Name into Site ID and plot info
  separate(`Sample Name`, c("Site ID","subplot.layer.date"), "_", remove = FALSE,) %>% 
  # separate the plot info into 3 columns
  separate(`subplot.layer.date`, c("Subplot", "Layer", "Date"), "-")
## Rows: 1754 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (8): Bin ID, Genome Name, Bin Quality, Bin Lineage, GTDB-Tk Taxonomy L...
## dbl  (10): IMG Genome ID, Bin Completeness, Bin Contamination, Total Number ...
## date  (1): Date Added
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Warning: Expected 6 pieces. Additional pieces discarded in 46 rows [3, 4, 24, 25, 26,
## 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 54, 232, 267, ...].
## Warning: Expected 6 pieces. Missing pieces filled with `NA` in 446 rows [1, 2, 9, 10,
## 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 46, 50, 53, ...].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 624 rows [4, 7, 8, 236,
## 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252,
## ...].
NEON_metagenomes <- read_tsv("data/NEON/exported_img_data_Gs0161344_NEON.tsv") %>% 
  rename(`Genome Name` = `Genome Name / Sample Name`) %>% 
  filter(str_detect(`Genome Name`, 're-annotation', negate = T)) %>% 
  filter(str_detect(`Genome Name`, 'WREF plot', negate = T))
## Rows: 176 Columns: 46
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (18): Domain, Sequencing Status, Study Name, Genome Name / Sample Name, ...
## dbl (16): taxon_oid, IMG Genome ID, Depth In Meters, Elevation In Meters, Ge...
## lgl (12): Altitude In Meters, Chlorophyll Concentration, Longhurst Code, Lon...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
NEON_metagenomes <- NEON_metagenomes %>% 
  # Get rid of the the common string "Soil microbial communities from "
  mutate_at("Genome Name", str_replace, "Terrestrial soil microbial communities from ", "") %>% 
  # Use the first `-` to split the column in two
  separate(`Genome Name`, c("Site","Sample Name"), " - ") %>% 
  # Get rid of the the common string "-comp-1"
  mutate_at("Sample Name", str_replace, "-comp-1", "") %>%
  # separate the Sample Name into Site ID and plot info
  separate(`Sample Name`, c("Site ID","subplot.layer.date"), "_", remove = FALSE,) %>% 
  # separate the plot info into 3 columns
  separate(`subplot.layer.date`, c("Subplot", "Layer", "Date"), "-") 
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [53].
NEON_chemistry <- read_tsv("data/NEON/neon_plot_soilChem1_metadata.tsv") %>% 
  # remove -COMP from genomicsSampleID
  mutate_at("genomicsSampleID", str_replace, "-COMP", "")
## Rows: 87 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr   (5): genomicsSampleID, siteID, plotID, nlcdClass, horizon
## dbl  (11): decimalLatitude, decimalLongitude, elevation, soilTemp, d15N, org...
## date  (1): collectionDate
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
kable(
  NEON_chemistry_description <- read_tsv("data/NEON/neon_soilChem1_metadata_descriptions.tsv") 
)
## Rows: 23 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (4): fieldName, description, dataType, units
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fieldName description dataType units
siteID NEON site code string NA
plotID Plot identifier (NEON site code_XXX) string NA
sampleID Identifier for sample string NA
horizon Organic or mineral soil string NA
genomicsSampleID Identifier for a genomics sample string NA
d15N Measure of the ratio of 15N:14N in a sample, relative to atmospheric N2 real permill
organicd13C Measure of the ratio of 13C:12C in soil organic carbon, relative to Vienna Pee Dee Belemnite real permill
nitrogenPercent Percent nitrogen in a sample on a dry weight basis real percent
organicCPercent Percent organic carbon in a sample on a dry weight basis real percent
CNratio Ratio of carbon to nitrogen concentration in a sample on a dry weight basis real NA
nlcdClass National Land Cover Database Vegetation Type Name string NA
subplotID Identifier for the NEON subplot string NA
coreCoordinateX x location of the soil core relative to the SW corner real meter
coreCoordinateY y location of the soil core relative to the SW corner real meter
decimalLatitude The geographic latitude (in decimal degrees, WGS84) of the geographic center of the reference area real decimalDegree
decimalLongitude The geographic longitude (in decimal degrees, WGS84) of the geographic center of the reference area real decimalDegree
elevation Elevation (in meters) above sea level real meter
sampleTiming Timing of the sampling event with regard to the field season string NA
soilTemp In-situ temperature of soil at approximately 10 cm depth real degree
sampleTopDepth Depth below the soil surface of the top of a soil sample real centimeter
sampleBottomDepth Depth below the soil surface of the bottom of a soil sample real centimeter
soilInWaterpH pH value of soil measured in water solution real pH
soilInCaClpH pH value of soil measured in calcium chloride solution real pH
NEON_MAGs_column <- NEON_MAGs %>%
  select("Sample Name","Site ID","GTDB-Tk Taxonomy Lineage" )
NEON_metagenomes_column <- NEON_metagenomes %>%
  select("Sample Name","Site ID","Ecosystem Subtype")
NEON_chemistry_column <- NEON_chemistry %>%
  select("genomicsSampleID","siteID","nlcdClass")
NEON_MAGS_site <- NEON_metagenomes_column %>%
 filter(`Site ID` == "ONAQ")
NEON_metagenomes_site <- NEON_metagenomes_column %>%
  filter(`Site ID` == "ONAQ")
NEON_chemistry_site <- NEON_chemistry_column %>%
  filter('Site ID' == "ONAQ")
NEON_MAGS_site %>% 
  left_join(NEON_metagenomes_site, by = "Sample Name")
## # A tibble: 6 × 5
##   `Sample Name`       `Site ID.x` `Ecosystem Subtype.x` `Site ID.y`
##   <chr>               <chr>       <chr>                 <chr>      
## 1 ONAQ_004-M-20210525 ONAQ        Shrubland             ONAQ       
## 2 ONAQ_010-M-20210526 ONAQ        Shrubland             ONAQ       
## 3 ONAQ_008-M-20210524 ONAQ        Shrubland             ONAQ       
## 4 ONAQ_002-M-20210524 ONAQ        Shrubland             ONAQ       
## 5 ONAQ_005-M-20210527 ONAQ        Shrubland             ONAQ       
## 6 ONAQ_003-M-20210527 ONAQ        Shrubland             ONAQ       
## # ℹ 1 more variable: `Ecosystem Subtype.y` <chr>
NEON_metagenomes_site %>% 
  left_join(NEON_chemistry_site, by = c("Sample Name" = "genomicsSampleID"))
## # A tibble: 6 × 5
##   `Sample Name`       `Site ID` `Ecosystem Subtype` siteID nlcdClass
##   <chr>               <chr>     <chr>               <chr>  <chr>    
## 1 ONAQ_004-M-20210525 ONAQ      Shrubland           <NA>   <NA>     
## 2 ONAQ_010-M-20210526 ONAQ      Shrubland           <NA>   <NA>     
## 3 ONAQ_008-M-20210524 ONAQ      Shrubland           <NA>   <NA>     
## 4 ONAQ_002-M-20210524 ONAQ      Shrubland           <NA>   <NA>     
## 5 ONAQ_005-M-20210527 ONAQ      Shrubland           <NA>   <NA>     
## 6 ONAQ_003-M-20210527 ONAQ      Shrubland           <NA>   <NA>
NEON_metagenomes_site %>% 
  left_join(NEON_chemistry_site, by = c("Sample Name" = "genomicsSampleID"))
## # A tibble: 6 × 5
##   `Sample Name`       `Site ID` `Ecosystem Subtype` siteID nlcdClass
##   <chr>               <chr>     <chr>               <chr>  <chr>    
## 1 ONAQ_004-M-20210525 ONAQ      Shrubland           <NA>   <NA>     
## 2 ONAQ_010-M-20210526 ONAQ      Shrubland           <NA>   <NA>     
## 3 ONAQ_008-M-20210524 ONAQ      Shrubland           <NA>   <NA>     
## 4 ONAQ_002-M-20210524 ONAQ      Shrubland           <NA>   <NA>     
## 5 ONAQ_005-M-20210527 ONAQ      Shrubland           <NA>   <NA>     
## 6 ONAQ_003-M-20210527 ONAQ      Shrubland           <NA>   <NA>
 NEON_metagenomes_site %>% 
  left_join(NEON_chemistry_site, by = c("Site ID" = "siteID"))
## # A tibble: 6 × 5
##   `Sample Name`       `Site ID` `Ecosystem Subtype` genomicsSampleID nlcdClass
##   <chr>               <chr>     <chr>               <chr>            <chr>    
## 1 ONAQ_004-M-20210525 ONAQ      Shrubland           <NA>             <NA>     
## 2 ONAQ_010-M-20210526 ONAQ      Shrubland           <NA>             <NA>     
## 3 ONAQ_008-M-20210524 ONAQ      Shrubland           <NA>             <NA>     
## 4 ONAQ_002-M-20210524 ONAQ      Shrubland           <NA>             <NA>     
## 5 ONAQ_005-M-20210527 ONAQ      Shrubland           <NA>             <NA>     
## 6 ONAQ_003-M-20210527 ONAQ      Shrubland           <NA>             <NA>
Table_7 <- NEON_metagenomes %>%
  full_join(NEON_MAGs, by = "Sample Name")
Table_8 <- Table_7 %>%
  full_join(NEON_chemistry, by = c("Sample Name" = "genomicsSampleID"))
Table_9 <- Table_8 %>%
  filter(str_detect(`GTDB-Tk Taxonomy Lineage`, "Methylomirabilota"))
Table_9 %>%
  ggplot(aes(x = fct_infreq(`Ecosystem Subtype`), y = soilTemp, color = Order)) + 
  geom_point() +
  coord_flip()
## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_point()`).

This graph is demonstrating The ecosystem sub types with which order is present at that ecosystem as well as what the soil temperature is. At the bottom, the grasslanbs have various counts of the Rokubacteriales at different soil temperatures.

Table_9 %>%
  ggplot(aes(x = fct_infreq(`nlcdClass`), y = soilInCaClpH, color = Family)) + 
  geom_point() +
  coord_flip()
## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_point()`).

This graph is showing the ecosystem sub types as well as what the soil pH is for the families at that location. The CSP1-6 family has various locations of ecosystems and a variety of soil pH. However,n the 2-02-FULL-66-22 family is only located in the emergent herbaceous wetlands with a soil pH of ~7.

Question: What are all the taxa at Onaqui?

Taxa at Onaqui
Taxa at Onaqui

This graph is showing all of the taxa that is present at this site.

Discussion of Results

The specific site that is observed is the terrestrial site Great Basin, Onaqui, Utah, USA. With the data collected through NEON, we were able to determine the taxonomic breakdown of this location as well as the soil temperature and pH at this location. The taxonomic breakdown includes the phylum count, and the orders in each phylum, family count, MAG count by class. The data shows that the most abundant phylum present at Onaqui is the Actinobacteriota phylum. This phylum also shows to have various counts of different orders within this phylum, reaching a total count of 35. The subplot graph is displaying the count of which orders are present in each subplot of the site. Subplot 004 has the highest MAG count of ~30, as well as the presence of almost all 14 classes present at the entire site of ONAQ. The graph indicating the family count at Onaqui shows that the highest count of 5 at this site is WHSQ-01, with 70-9 as the next highest count with 4. From the NEON data we were also able to visualize the soil temperature and pH at this site. The soil temperature at Onaqui is within a range of ~12-16°. The median temperature of the soil at this location is closest to the 3rd quartile which is right below 15°. The soil temperature at Onaqui has overlapping temperatures with many other sites studied through NEON. The pH of the soil at this site is illustrated in the graph which shows which family is present at that soil pH. The family CSP1-6 is found at soil pH at a range from 5-7. The family 2-02-FULL-66-22 is only present at the soil pH of ~7.

Phylum Specific

Question: Where in the US are the phylum of Methylomirabilota found? Is Methylomirabilota found in our selected site?

NEON_MAGS_table <- NEON_MAGs_bact_ind %>%
  filter(Phylum=='Methylomirabilota')
datatable(
  NEON_MAGS_table %>%
  count(Site, sort = TRUE))

This table is showing the 5 locations in which Methylomirabilota is found, as well as the count. This is showing that the highest amount of our phylum found is in National Grasslands LBJ, Texas, USA. Additionally, the Methylomirabilota is not found in our selected site, Utah.

NEON_MAGS_table %>% 
  ggplot(aes(x = fct_rev(fct_infreq(Site)), fill = `Genus`)) + geom_bar() +
  coord_flip() +
  labs(title = "Genus at each site", y= "Count", x = "Site")

This graph is showing the genus that is present at each site. The genus is a smaller group of the taxonomic ranking. It shows which genus from the Methylomirabilota phylum is located in which location. The site with the most genus’ is Texas, USA. Additionally, the genus AR12 is most abundant here.

Question: Where is each order found?

Order of Phylum per Site
Order of Phylum per Site

This is showing the specific order from the phylum Methylomirabilota at each site found. It shows that 4 of the sites have presence of Rokubacteriales while 1 site has the order of Methylomirabirales present.

Question: What are the individual and co assemblies for Methylomirabilota?

Co Assembly
Co Assembly

This is the co-assembly of Methylomirabilota at our site.

Individual Assembly
Individual Assembly

This is the individual assembly of Methylomirabilota at our site.

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("treeio")
## 'getOption("repos")' replaces Bioconductor standard repositories, see
## 'help("repositories", package = "BiocManager")' for details.
## Replacement repositories:
##     CRAN: http://rspm/default/__linux__/focal/latest
## Bioconductor version 3.18 (BiocManager 1.30.23), R 4.3.3 (2024-02-29)
## Warning: package(s) not installed when version(s) same as or greater than current; use
##   `force = TRUE` to re-install: 'treeio'
## Installation paths not writeable, unable to update packages
##   path: /opt/R/4.3.3/lib/R/library
##   packages:
##     boot, codetools, lattice, survival
BiocManager::install("ggtreeExtra")
## 'getOption("repos")' replaces Bioconductor standard repositories, see
## 'help("repositories", package = "BiocManager")' for details.
## Replacement repositories:
##     CRAN: http://rspm/default/__linux__/focal/latest
## Bioconductor version 3.18 (BiocManager 1.30.23), R 4.3.3 (2024-02-29)
## Warning: package(s) not installed when version(s) same as or greater than current; use
##   `force = TRUE` to re-install: 'ggtreeExtra'
## Installation paths not writeable, unable to update packages
##   path: /opt/R/4.3.3/lib/R/library
##   packages:
##     boot, codetools, lattice, survival
library(tidyverse)
library(knitr)
library(ggtree)
library(TDbook) #A Companion Package for the Book "Data Integration, Manipulation and Visualization of Phylogenetic Trees" by Guangchuang Yu (2022, ISBN:9781032233574).
library(ggimage)
library(rphylopic)
library(treeio)
library(tidytree)
library(ape)
library(TreeTools)
library(phytools)
library(ggnewscale)
library(ggtreeExtra)
library(ggstar)
NEON_MAGs <- read_csv("data/NEON/GOLD_Study_ID_Gs0161344_NEON_2024_4_21.csv") %>% 
  # remove columns that are not needed for data analysis
  select(-c(`GOLD Study ID`, `Bin Methods`, `Created By`, `Date Added`, `Bin Lineage`)) %>% 
  # create a new column with the Assembly Type
  mutate("Assembly Type" = case_when(`Genome Name` == "NEON combined assembly" ~ `Genome Name`,
                            TRUE ~ "Individual")) %>% 
  mutate_at("Assembly Type", str_replace, "NEON combined assembly", "Combined") %>% 
  mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "d__", "") %>%  
  mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "p__", "") %>% 
  mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "c__", "") %>% 
  mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "o__", "") %>% 
  mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "f__", "") %>% 
  mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "g__", "") %>% 
  mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "s__", "") %>%
  separate(`GTDB-Tk Taxonomy Lineage`, c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species"), ";", remove = FALSE) %>% 
  mutate_at("Domain", na_if,"") %>% 
  mutate_at("Phylum", na_if,"") %>% 
  mutate_at("Class", na_if,"") %>% 
  mutate_at("Order", na_if,"") %>% 
  mutate_at("Family", na_if,"") %>% 
  mutate_at("Genus", na_if,"") %>% 
  mutate_at("Species", na_if,"") %>% 
  
  # Get rid of the the common string "Soil microbial communities from "
  mutate_at("Genome Name", str_replace, "Terrestrial soil microbial communities from ", "") %>% 
  # Use the first `-` to split the column in two
  separate(`Genome Name`, c("Site","Sample Name"), " - ") %>% 
  # Get rid of the the common string "S-comp-1"
  mutate_at("Sample Name", str_replace, "-comp-1", "") %>%
  # separate the Sample Name into Site ID and plot info
  separate(`Sample Name`, c("Site ID","subplot.layer.date"), "_", remove = FALSE,) %>% 
  # separate the plot info into 3 columns
  separate(`subplot.layer.date`, c("Subplot", "Layer", "Date"), "-")
## Rows: 1754 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (8): Bin ID, Genome Name, Bin Quality, Bin Lineage, GTDB-Tk Taxonomy L...
## dbl  (10): IMG Genome ID, Bin Completeness, Bin Contamination, Total Number ...
## date  (1): Date Added
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 624 rows [1131, 1132,
## 1133, 1134, 1135, 1136, 1137, 1138, 1139, 1140, 1141, 1142, 1143, 1144, 1145,
## 1146, 1147, 1148, 1149, 1150, ...].
NEON_metagenomes <- read_tsv("data/NEON/exported_img_data_Gs0161344_NEON.tsv") %>% 
  select(-c(`Domain`, `Sequencing Status`, `Sequencing Center`)) %>% 
  rename(`Genome Name` = `Genome Name / Sample Name`) %>% 
  filter(str_detect(`Genome Name`, 're-annotation', negate = T)) %>% 
  filter(str_detect(`Genome Name`, 'WREF plot', negate = T)) 
## Rows: 176 Columns: 46
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (18): Domain, Sequencing Status, Study Name, Genome Name / Sample Name, ...
## dbl (16): taxon_oid, IMG Genome ID, Depth In Meters, Elevation In Meters, Ge...
## lgl (12): Altitude In Meters, Chlorophyll Concentration, Longhurst Code, Lon...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
NEON_metagenomes <- NEON_metagenomes %>% 
  # Get rid of the the common string "Soil microbial communities from "
  mutate_at("Genome Name", str_replace, "Terrestrial soil microbial communities from ", "") %>% 
  # Use the first `-` to split the column in two
  separate(`Genome Name`, c("Site","Sample Name"), " - ") %>% 
  # Get rid of the the common string "-comp-1"
  mutate_at("Sample Name", str_replace, "-comp-1", "") %>%
  # separate the Sample Name into Site ID and plot info
  separate(`Sample Name`, c("Site ID","subplot.layer.date"), "_", remove = FALSE,) %>% 
  # separate the plot info into 3 columns
  separate(`subplot.layer.date`, c("Subplot", "Layer", "Date"), "-") 
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [53].
NEON_chemistry <- read_tsv("data/NEON/neon_plot_soilChem1_metadata.tsv") %>% 
  # remove -COMP from genomicsSampleID
  mutate_at("genomicsSampleID", str_replace, "-COMP", "")
## Rows: 87 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr   (5): genomicsSampleID, siteID, plotID, nlcdClass, horizon
## dbl  (11): decimalLatitude, decimalLongitude, elevation, soilTemp, d15N, org...
## date  (1): collectionDate
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
NEON_MAGs_metagenomes_chemistry <- NEON_MAGs %>% 
  left_join(NEON_metagenomes, by = "Sample Name") %>% 
  left_join(NEON_chemistry, by = c("Sample Name" = "genomicsSampleID")) %>% 
  rename("label" = "Bin ID")
tree_arc <- read.tree("data/NEON/gtdbtk.ar53.decorated.tree")
tree_bac <- read.tree("data/NEON/gtdbtk.bac120.decorated.tree")
# Make a vector with the internal node labels
node_vector_bac = c(tree_bac$tip.label,tree_bac$node.label)

# Search for your Phylum or Class to get the node
grep("Methylomirabilota", node_vector_bac, value = TRUE)
## [1] "'0.999:p__Methylomirabilota; c__Methylomirabilia'"
match(grep("Methylomirabilota", node_vector_bac, value = TRUE), node_vector_bac)
## [1] 2651
# Make a vector with the internal node labels
node_vector_arc = c(tree_arc$tip.label,tree_arc$node.label)

# Search for your Phylum or Class to get the node
grep("p__", node_vector_arc, value = TRUE)
## [1] "'1.0:p__Halobacteriota; c__Methanosarcinia; o__Methanosarcinales; f__Methanosarcinaceae; g__Methanosarcina'"                                            
## [2] "'1.0:p__Thermoplasmatota; c__E2; o__JACPAO01; f__JAHFTW01'"                                                                                             
## [3] "'1.0:p__Methanobacteriota; c__Methanobacteria; o__Methanobacteriales; f__Methanobacteriaceae; g__Methanobacterium_B; s__Methanobacterium_B sp003151535'"
## [4] "'1.0:p__Thermoproteota'"
match(grep("p__", node_vector_arc, value = TRUE), node_vector_arc)
## [1] 46 50 54 55
# First need to preorder tree before extracting. N
tree_bac_preorder <- Preorder(tree_bac)
tree_Methylomirabilota <- Subtree(tree_bac_preorder, 1712)
NEON_MAGs_Methylomirabilota <- NEON_MAGs_metagenomes_chemistry %>% 
  filter(Phylum == "Methylomirabilota")
ggtree(tree_Methylomirabilota, layout="circular")  %<+%
  NEON_MAGs_metagenomes_chemistry + 
  geom_tiplab(size=2, hjust=-.1) +
  xlim(0,20) +
  geom_point(mapping=aes(color=Class, shape = `Assembly Type`))
## Warning: Removed 46 rows containing missing values or values outside the scale range
## (`geom_point()`).

This phylogeny tree is showing the classes within the phylum Methylomirabilota as well as whether or not they are co-assembled or individually assembled. This is another way to look at the data that was also provided in the sankey plots.

NEON_MAGs_metagenomes_chemistry_noblank <- NEON_MAGs_metagenomes_chemistry %>% 
  rename("AssemblyType" = "Assembly Type") %>% 
  rename("BinCompleteness" = "Bin Completeness") %>% 
  rename("BinContamination" = "Bin Contamination") %>% 
  rename("TotalNumberofBases" = "Total Number of Bases") %>% 
  rename("EcosystemSubtype" = "Ecosystem Subtype")
ggtree(tree_Methylomirabilota, layout="circular", branch.length="none") %<+% 
  NEON_MAGs_metagenomes_chemistry + 
  geom_point2(mapping=aes(color=`Ecosystem Subtype`, size=`Total Number of Bases`)) + 
  new_scale_fill() + 
  geom_fruit(
      data=NEON_MAGs_metagenomes_chemistry_noblank,
      geom=geom_tile,
      mapping=aes(y=label, x=1, fill= AssemblyType),
      offset=0.08,   # The distance between external layers, default is 0.03 times of x range of tree.
      pwidth=0.25 # width of the external layer, default is 0.2 times of x range of tree.
      ) + 
  new_scale_fill() +
  geom_fruit(
          data=NEON_MAGs_metagenomes_chemistry_noblank,
          geom=geom_col,
          mapping=aes(y=label, x=TotalNumberofBases),  
          pwidth=0.4,
          axis.params=list(
                          axis="x", # add axis text of the layer.
                          text.angle=-45, # the text size of axis.
                          hjust=0  # adjust the horizontal position of text of axis.
                      ),
          grid.params=list() # add the grid line of the external bar plot.
      ) + 
      theme(#legend.position=c(0.96, 0.5), # the position of legend.
          legend.background=element_rect(fill=NA), # the background of legend.
          legend.title=element_text(size=7), # the title size of legend.
          legend.text=element_text(size=6), # the text size of legend.
          legend.spacing.y = unit(0.02, "cm")  # the distance of legends (y orientation).
      )
## ! The following column names/name: Site.x, Sample Name, Site ID.x, Subplot.x, Layer.x, Date.x, IMG Genome ID.x, Bin Quality, GTDB-Tk Taxonomy Lineage, Domain, Phylum, Class, Order, Family, Genus, Species, 5s rRNA, 16s rRNA, 23s rRNA, tRNA Genes, Gene Count, Scaffold Count, taxon_oid, Study Name, Site.y, Site ID.y, Subplot.y, Layer.y, Date.y, IMG Genome ID.y, GOLD Study ID, Ecosystem, Ecosystem Category, Ecosystem Type, Specific Ecosystem, Altitude In Meters, Chlorophyll Concentration, Depth In Meters, Elevation In Meters, Geographic Location, Habitat, Isolation, Isolation Country, Latitude, Longhurst Code, Longhurst Description, Longitude, Nitrate Concentration, Oxygen Concentration, pH, Pressure, Salinity, Salinity Concentration, Sample Collection Date, Sample Collection Temperature, Subsurface In Meters, Genome Size  * assembled, Gene Count  * assembled, Scaffold Count  * assembled, Genome MetaBAT Bin Count  * assembled, Genome EukCC Bin Count  * assembled, CRISPR Count  * assembled, GC Count  * assembled, GC  * assembled, Coding Base Count  * assembled, Coding Base Count %  * assembled, CDS Count  * assembled, CDS %  * assembled, siteID, plotID, nlcdClass, decimalLatitude, decimalLongitude, elevation, collectionDate, horizon, soilTemp, d15N, organicd13C, nitrogenPercent, organicCPercent, CNratio, soilInWaterpH, soilInCaClpH are/is the same to tree data, the tree data column names are : label, y, angle, Site.x, Sample Name, Site ID.x, Subplot.x, Layer.x, Date.x, IMG Genome ID.x, Bin Quality, GTDB-Tk Taxonomy Lineage, Domain, Phylum, Class, Order, Family, Genus, Species, Bin Completeness, Bin Contamination, Total Number of Bases, 5s rRNA, 16s rRNA, 23s rRNA, tRNA Genes, Gene Count, Scaffold Count, Assembly Type, taxon_oid, Study Name, Site.y, Site ID.y, Subplot.y, Layer.y, Date.y, IMG Genome ID.y, GOLD Study ID, Ecosystem, Ecosystem Category, Ecosystem Subtype, Ecosystem Type, Specific Ecosystem, Altitude In Meters, Chlorophyll Concentration, Depth In Meters, Elevation In Meters, Geographic Location, Habitat, Isolation, Isolation Country, Latitude, Longhurst Code, Longhurst Description, Longitude, Nitrate Concentration, Oxygen Concentration, pH, Pressure, Salinity, Salinity Concentration, Sample Collection Date, Sample Collection Temperature, Subsurface In Meters, Genome Size  * assembled, Gene Count  * assembled, Scaffold Count  * assembled, Genome MetaBAT Bin Count  * assembled, Genome EukCC Bin Count  * assembled, CRISPR Count  * assembled, GC Count  * assembled, GC  * assembled, Coding Base Count  * assembled, Coding Base Count %  * assembled, CDS Count  * assembled, CDS %  * assembled, siteID, plotID, nlcdClass, decimalLatitude, decimalLongitude, elevation, collectionDate, horizon, soilTemp, d15N, organicd13C, nitrogenPercent, organicCPercent, CNratio, soilInWaterpH, soilInCaClpH.
## ! The following column names/name: Site.x, Sample Name, Site ID.x, Subplot.x, Layer.x, Date.x, IMG Genome ID.x, Bin Quality, GTDB-Tk Taxonomy Lineage, Domain, Phylum, Class, Order, Family, Genus, Species, 5s rRNA, 16s rRNA, 23s rRNA, tRNA Genes, Gene Count, Scaffold Count, taxon_oid, Study Name, Site.y, Site ID.y, Subplot.y, Layer.y, Date.y, IMG Genome ID.y, GOLD Study ID, Ecosystem, Ecosystem Category, Ecosystem Type, Specific Ecosystem, Altitude In Meters, Chlorophyll Concentration, Depth In Meters, Elevation In Meters, Geographic Location, Habitat, Isolation, Isolation Country, Latitude, Longhurst Code, Longhurst Description, Longitude, Nitrate Concentration, Oxygen Concentration, pH, Pressure, Salinity, Salinity Concentration, Sample Collection Date, Sample Collection Temperature, Subsurface In Meters, Genome Size  * assembled, Gene Count  * assembled, Scaffold Count  * assembled, Genome MetaBAT Bin Count  * assembled, Genome EukCC Bin Count  * assembled, CRISPR Count  * assembled, GC Count  * assembled, GC  * assembled, Coding Base Count  * assembled, Coding Base Count %  * assembled, CDS Count  * assembled, CDS %  * assembled, siteID, plotID, nlcdClass, decimalLatitude, decimalLongitude, elevation, collectionDate, horizon, soilTemp, d15N, organicd13C, nitrogenPercent, organicCPercent, CNratio, soilInWaterpH, soilInCaClpH are/is the same to tree data, the tree data column names are : label, y, angle, Site.x, Sample Name, Site ID.x, Subplot.x, Layer.x, Date.x, IMG Genome ID.x, Bin Quality, GTDB-Tk Taxonomy Lineage, Domain, Phylum, Class, Order, Family, Genus, Species, Bin Completeness, Bin Contamination, Total Number of Bases, 5s rRNA, 16s rRNA, 23s rRNA, tRNA Genes, Gene Count, Scaffold Count, Assembly Type, taxon_oid, Study Name, Site.y, Site ID.y, Subplot.y, Layer.y, Date.y, IMG Genome ID.y, GOLD Study ID, Ecosystem, Ecosystem Category, Ecosystem Subtype, Ecosystem Type, Specific Ecosystem, Altitude In Meters, Chlorophyll Concentration, Depth In Meters, Elevation In Meters, Geographic Location, Habitat, Isolation, Isolation Country, Latitude, Longhurst Code, Longhurst Description, Longitude, Nitrate Concentration, Oxygen Concentration, pH, Pressure, Salinity, Salinity Concentration, Sample Collection Date, Sample Collection Temperature, Subsurface In Meters, Genome Size  * assembled, Gene Count  * assembled, Scaffold Count  * assembled, Genome MetaBAT Bin Count  * assembled, Genome EukCC Bin Count  * assembled, CRISPR Count  * assembled, GC Count  * assembled, GC  * assembled, Coding Base Count  * assembled, Coding Base Count %  * assembled, CDS Count  * assembled, CDS %  * assembled, siteID, plotID, nlcdClass, decimalLatitude, decimalLongitude, elevation, collectionDate, horizon, soilTemp, d15N, organicd13C, nitrogenPercent, organicCPercent, CNratio, soilInWaterpH, soilInCaClpH, xmaxtmp.
## Warning: Removed 46 rows containing missing values or values outside the scale range
## (`geom_point_g_gtree()`).

This table is an extended phylogeny tree of the one from before. This includes the ecosystem sub type, individual and co-assemblys as well as the total number of bases.

Question: What are the soil temperature for this phylum?

Soil temp per Phylum
Soil temp per Phylum

This shows the soil temperature that each phylum is found at. Methylomirabilota has a smaller range of temperatures compared to other phylum. This range is from ~14-24°.

Question: What are the ecosystem sub types for Methylomirabilota?

NEON_MAGs_metagenomes_chemistry_noblank <- NEON_MAGs_metagenomes_chemistry %>% 
  rename("AssemblyType" = "Assembly Type") %>% 
  rename("BinCompleteness" = "Bin Completeness") %>% 
  rename("BinContamination" = "Bin Contamination") %>% 
  rename("TotalNumberofBases" = "Total Number of Bases") %>% 
  rename("EcosystemSubtype" = "Ecosystem Subtype")

ggtree(tree_Methylomirabilota)  %<+%
  NEON_MAGs_metagenomes_chemistry + 
  geom_tippoint(aes(colour=`Ecosystem Subtype`)) + 

# For unknown reasons the following does not like blank spaces in the names
  geom_facet(panel = "Bin Completeness", data = NEON_MAGs_metagenomes_chemistry_noblank, geom = geom_point, 
      mapping=aes(x = BinCompleteness)) +
  geom_facet(panel = "Bin Contamination", data = NEON_MAGs_metagenomes_chemistry_noblank, geom = geom_col, 
                aes(x = BinContamination), orientation = 'y', width = .6) +
  theme_tree2(legend.position=c(.1, .7))

This image has a combination of 3 tables. The far left is showing the ecosystem sub types in which Methylomirabilota is found in. The colors are coordinated to type of ecosystem they are found in. The other two graphs are showing the bin completeness and contamination. This is including the amount of gene markers that are shared between them.

Discussion of Results

As mentioned before, Methylomirabilis, or NC10, is a bacterial phylum. Within the NEON database, there are 23 collections of bacteria that belong to the Methylomirabilis phylum. Out of the 23 collections, 16 collections were collected by the Individual Assembly while 7 collections were from the NEON combined assembly. This lab will mostly focus on the 16 collections from the Individual Assembly.

As it can be seen within the Individual Assembly collections above, all 16 out of the 16 collections belong to the Class Methylomirabilia. However, when going deeper down into the taxonomy, 15 out of the 16 collections belong to the Order Rokubacteriales while 1 collection belongs to the Order Methylmirabilales.

Outside of the taxonomic breakdown, it can be seen that the Methylomirabilis Phylum was collected at 5 different NEON sites, with a majority of the collections at the National Grasslands LBJ and Konza Prairie BioStation sites. Within that, there are 6 genus that are also found within those sites. The two orders, methylomirabilis and rokubacterias were also found at those sites. However, Rokubacteriales was found at 4 locations while Methylmirablilales was only found at one. The soil temperature that this phylum was found to be at was 14-24°. This range is much smaller than all the other ranges of temperature for the other phylum.

Discussion

The results that were obtained from the NEON data set as well as the visuals from above were used to make conclusions about the site and phylum we highlighted. We found that when looking at the individual assemblies for our phylum at the site, there is no evidence of it being found. However, the co-assembly shows evidence in which there is little presence of the phylum, Methylomirabilota. We were able to gather more information as to which phylum are prominent features to the ecosystem in Onaqui. These include Actinobacteria, Chloroflexota and Proteobacteria. Additionally, we were able to break down even further as to which orders and families were found. It was found that the average soil temperature at this site is much lower than the average soil temperature in which Methylomirabilota is found in. This could be one of the factors as to why it is not found there. By understanding Methylomirabilota, what it’s characteristics and functions are, and what environments it is found in, we can better understand its importance to the ecosystem most abundant in. Additionally, by gathering data from the site, we can analyze how the phylum of bacteria found within Onaqui impact the ecosystem in which it is today.

Conclusion

We thank NEON for creating useful, clear and important data in which we can further analyze. The importance of understanding these phylum and sites, specifically Onaqui and Methylomirabilota, is to better grasp the ecosystems in the USA. Working with the data collected from NEON, we were able to produce graphs and tables which visually display the data. With this, we were able to make conclusions as to what is present, and perhaps even why that it. There is much more data that is available within NEON that can allow us to further understand the different ecosystems in the US. In the future, with our data and findings, we can continue to learn more about how our ecosystems are changing and how we can adapt to them and protect them so they continue to be healthy and stable.

References

(He et al. 2016) (“Onaqui NEON NSF NEON Open Data to Understand Our Ecosystems n.d.) (DOE Joint Genome Institute: A DOE Office of Science User Facility of Lawrence Berkeley National Laboratory n.d.) (Clum et al. n.d.) (Baxter 2018) (Holthuijzen and Veblen 2015)

Baxter, Bonnie K. 2018. “Great Salt Lake Microbiology: A Historical Perspective.” International Microbiology: The Official Journal of the Spanish Society for Microbiology 21 (3): 79–95. https://doi.org/10.1007/s10123-018-0008-z.
Clum, Alicia, Marcel Huntemann, Brian Bushnell, Brian Foster, Bryce Foster, Simon Roux, Patrick P. Hajek, et al. n.d. DOE JGI Metagenome Workflow.” mSystems 6 (3): e00804–20. Accessed May 8, 2024. https://doi.org/10.1128/mSystems.00804-20.
DOE Joint Genome Institute: A DOE Office of Science User Facility of Lawrence Berkeley National Laboratory.” n.d. DOE Joint Genome Institute. Accessed May 8, 2024. https://jgi.doe.gov/.
He, Zhanfei, Chaoyang Cai, Jiaqi Wang, Xinhua Xu, Ping Zheng, Mike S. M. Jetten, and Baolan Hu. 2016. “A Novel Denitrifying Methanotroph of the NC10 Phylum and Its Microcolony.” Scientific Reports 6 (1): 32241. https://doi.org/10.1038/srep32241.
Holthuijzen, Maike F., and Kari E. Veblen. 2015. “Grass-Shrub Associations over a Precipitation Gradient and Their Implications for Restoration in the Great Basin, USA.” PloS One 10 (12): e0143170. https://doi.org/10.1371/journal.pone.0143170.
“Onaqui NEON NSF NEON Open Data to Understand Our Ecosystems.” n.d. Accessed May 8, 2024. https://www.neonscience.org/field-sites/onaq.